Duplicate based schema matching

نویسنده

Alexander Bilke

چکیده

The integration of independently developed data sources poses many problems, which are the result of several types of heterogeneity. One of the most daunting challenges is schema matching, which is the semi-automatic process of detecting semantic relationships between attributes in heterogeneous schemata. Various solutions that exploit schema information or extract specific features from attribute values have been described. In this thesis we propose novel schema matching algorithms that exploit fuzzy duplicates, i.e., different representations of the same real-world entity. We describe the DUMAS table matcher, whose goal is to establish attribute correspondences between two tables. Finding the duplicates that can be used for schema matching is a challenging task because the semantic relationships between the tables are unknown, and thus, existing duplicate detection solutions cannot be applied. We discuss the novel problem of duplicate detection in unaligned relations and describe an algorithm that is able to detect the top-k duplicates. The attribute correspondences between the two tables are extracted from those duplicates in a subsequent step. The DUMAS schema matcher extends the duplicate-based matching approach to complex schemata consisting of multiple tables. Finding attribute correspondences between complex schemata poses several new challenges that do not occur when single tables are to be matched, and thus, complicate the application of the table matcher. We describe heuristics used to determine if a table matching can be trusted, and develop an algorithm that exploits multitable duplicates to detect correspondences between complex schemata. The previous two algorithms are restricted to simple (i.e, 1:1) correspondences. Because complex (i.e., 1:n or m:n) do occur in practice, we developed the DUMAS complex matcher. The matcher uses the result of the DUMAS table matcher and improves the matching by merging certain attributes, and thus, detecting complex correspondences. Because the space of possible complex matchings is very large, we devised several heuristics to decrease the number of attribute combinations that have to be considered.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Semantic Schema Matching Approach

Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...

متن کامل

Eliminating NULLs with Subsumption and Complementation

In a data integration process, an important step after schema matching and duplicate detection is data fusion. It is concerned with the combination or merging of different representations of one real-world object into a single, consistent representation. In order to solve potential data conflicts, many different conflict resolution strategies can be applied. In particular, some representations ...

متن کامل

Record Matching Over Query Results Using Fuzzy Ontological Document Clustering

Record matching is an essential step in duplicate detection as it identifies records representing same real-world entity. Supervised record matching methods require users to provide training data and therefore cannot be applied for web databases where query results are generated on-the-fly. To overcome the problem, a new record matching method named Unsupervised Duplicate Elimination (UDE) is p...

متن کامل

A Semi Automatic Tool For Schema Mapping

neric mapping framework at the schema level to address the problem of schema interoperability Providing a formalism for developing a generic, extensible, and semi-automated mapping A semi-automatic tool for schema mapping. at the University of Washington in Seattle, where he founded the database group. on Clio, the first semi-automatic tool for heterogeneous schema mapping. Keywords: data integ...

متن کامل

Finding nontrivial semantic matches between database schemas

Finding nontrivial semantic matches between database schemas 3 Summary Automation of schema matching has been under investigation for already some decades, still the systems usually do not find all matches or suggests incorrect matches. Due to this imperfection matching schemas it is still often done manually by domain experts. The rapidly increasing number of heterogeneous and distributed data...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Duplicate based schema matching

نویسنده

چکیده

منابع مشابه

An Improved Semantic Schema Matching Approach

Eliminating NULLs with Subsumption and Complementation

Record Matching Over Query Results Using Fuzzy Ontological Document Clustering

A Semi Automatic Tool For Schema Mapping

Finding nontrivial semantic matches between database schemas

عنوان ژورنال:

اشتراک گذاری